Unveiling the Secrets of Wine Quality¶

Bypassing traditional tasting methods, this project employs data analysis to predict the quality of wine based on its chemical features.

🍷 1. Introduction about the Data Set¶

📖 1.1 General Information:¶

  • Provided by: The UCI Machine Learning Repository.
  • Donated by: Paulo Cortez, Antonio Cerdeira, Fernando Almeida, Telmo Matos, and Jose Reis.
    • 📜 Their paper: Modeling wine preferences by data mining from physicochemical properties.
    • 🧐 Main idea: Using data mining to understand how various factors influence wine quality, offering insights into wine production and certification.
    • ⚒️ Approach: Support Vector Machines (SVM), Neural Networks (NN), and Multiple Regression (MR) techniques.
    • 🧰 Conclusion:
      • For assessing wine quality, the Support Vector Machine (SVM) method outperforms other techniques in accuracy, especially for white wines.
      • Alcohol level is a key factor in determining wine quality. Citric acid and residual sugar are more significant in white wines, whereas sulphates are highly important in both types.

🍇 1.2 Info about the Wine:¶

  • Types: Both white and red wines from the Vinho Verde region in northwestern Portugal 🇵🇹.
  • Production: Represents 15% of Portuguese production.

📊 1.3 Info about the Datasets:¶

  • Wines: 1599 red and 4898 white samples.
  • Collection:
    • ⏳ Timeframe: May 2004 to February 2007.
    • 🏷️ Type: Only protected designation of origin samples by CVRVV (Comissão de Viticultura da Região dos Vinhos Verdes), focused on enhancing the quality and marketing of vinho verde.
  • Quality Assessment:
    • Rated by at least three sensory assessors (blind tastes), on a 0 (very bad) to 10 (excellent) scale. The final score is the median of these ratings.
  • Chemical Features Tested:
    • 🧪 Data recorded by iLab, a computerized system managing wine sample testing.
    • Tests include density, alcohol, pH values, etc.
  • Limitation:
    • Lack of Temporal Information:
      • We are unable to analyze variations in wine quality across different years, also making it impossible for us to identify the relationship between weather conditions and wine quality.
    • Lack of Brand and Public Preference Data:
      • We are unable to establish a direct link between wine quality attributes and consumer preferences or sales performance.

2. Research Questions and Motivations¶

2.1 Research Questions¶

Our reserach questions enhances and expands upon prior studies by:

  • 🎯 Focusing on Classification: Utilizing advanced models like logistic regression and Random Forest for classifying wine quality tiers, assessing their accuracy, and pinpointing crucial quality influencers.

  • 🛠️ Model Comparison Pipeline: Developing a systematic pipeline to contrast various models. This includes tuning hyperparameters and evaluating performance scores.

  • 🍇 Quality Wine Recipes: Crafting formulas for both top-quality and poor-quality wines. These models aim to avert the production of low-quality wines and spotlight the unique attributes of top-tier wines.

  • 🔍 Deep Dive with PCA: Investigating nuances in high-rated wines and applying Principal Component Analysis (PCA) for a more thorough exploration, surpassing traditional data mining approaches.

2.2 Motivations:¶

  • 🇫🇷 Cultural Significance: Residing in France, a nation celebrated for its wine tradition, we seek to deepen our understanding of wine. This analysisfosters a greater appreciation of this heritage.

  • 📊 Analytical Depth: Leveraging data-driven methods to explore wine quality nuances. This exploration will enhance our analytical skills while shedding light on hidden characteristics within wines.

  • 🍾 Enhancing Wine Production: Providing actionable insights for quality improvement through advanced statistical and machine learning techniques.

3. Data Analysis¶

In [1]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import zscore

3.1 Extract Data: Reading from CSV Files¶

In [2]:
# These csv files are downloaded from the UCI website.

df_white_wine = pd.read_csv("data/winequality-white.csv", sep=";")

df_red_wine = pd.read_csv("data/winequality-red.csv",sep=";")

df_red_wine
df_white_wine
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.00100 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.99400 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.99510 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.99560 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.99560 3.19 0.40 9.9 6
... ... ... ... ... ... ... ... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039 24.0 92.0 0.99114 3.27 0.50 11.2 6
4894 6.6 0.32 0.36 8.0 0.047 57.0 168.0 0.99490 3.15 0.46 9.6 5
4895 6.5 0.24 0.19 1.2 0.041 30.0 111.0 0.99254 2.99 0.46 9.4 6
4896 5.5 0.29 0.30 1.1 0.022 20.0 110.0 0.98869 3.34 0.38 12.8 7
4897 6.0 0.21 0.38 0.8 0.020 22.0 98.0 0.98941 3.26 0.32 11.8 6

4898 rows × 12 columns

In [3]:
df_white_wine.head()
Out[3]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [4]:
df_red_wine.head()
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [5]:
wine_lists = [df_red_wine, df_white_wine]

df_white_wine.wine_type = "White Wine"
df_red_wine.wine_type = "Red Wine"

def get_wine_str(wine_type_df):
    return getattr(wine_type_df, 'wine_type', "Unknown Wine")

3.2 Transformation¶

Here we want to create QQ Plots to understand if the features follow normal distribution or not.

In [6]:
def check_numeric_columns(wine_type_df):
    return wine_type_df.select_dtypes(include=[np.number]).columns
In [7]:
## Q-Q 
# boxplot - log scale 



def create_qq_plot(wine_type_df):
    wine_type = get_wine_str(wine_type_df)

    # Select only the numerical columns from the DataFrame
    numeric_columns = check_numeric_columns(wine_type_df)

    # Set up the matplotlib figure and axes for a 3x3 grid
    fig, axs = plt.subplots(3, 4, figsize=(20, 15))  # Adjust the size as needed

    # Flatten the array of axes to make it easier to iterate over
    axs = axs.flatten()

    # Loop over the numerical columns and create a Q-Q plot for each
    for i, column in enumerate(numeric_columns):  
        data = wine_type_df[column]
        stats.probplot(data, dist="norm", plot=axs[i])
        axs[i].set_title(column)
        axs[i].set_xlabel('')
        axs[i].set_ylabel('')

    # Adjust layout to prevent overlap
    fig.suptitle(f"QQ Plots for {wine_type}", fontsize=16)
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()
In [8]:
def create_plots(plot_function, list = wine_lists):
    for i in list:
        plot_function(i)
In [9]:
create_plots(create_qq_plot, wine_lists)
No description has been provided for this image
No description has been provided for this image

Observations from QQ Plots of Red and White Wine Datasets:¶

  1. Right Skewness:

    • Red Wine: Residual Sugar and Alcohol show deviations in the lower quantiles.
    • White Wine: Similar to Red Wine, Residual Sugar and Alcohol show deviations in the lower quantiles.
  2. Left Skewness:

    • Red Wine: Free Sulfur Dioxide, Chlorides, and Sulphates show deviations in the upper quantiles.
    • White Wine: Volatile Acidity, Chlorides, and Sulphates show deviations in the upper quantiles.
  3. Implications for Data Processing:

    • The observed skewness in both datasets suggests the need for normalization transformations. We will continue to calculate skewness coefficient.
    • Techniques like logarithmic or Box-Cox transformations may be beneficial to address these deviations and improve the homogeneity of the data.
In [10]:
def calculate_skewness_coefficient(wine_type_df):

    print(f"\nThe skewness coefficient of {get_wine_str(wine_type_df)}: \n")

    numerical_columns = wine_type_df.select_dtypes(include=['number']).columns

    for column in numerical_columns:  
        skewness = round(wine_type_df[column].skew(),2)
        print(f"{column}: {skewness}")
In [11]:
create_plots(calculate_skewness_coefficient, wine_lists)
The skewness coefficient of Red Wine: 

fixed acidity: 0.98
volatile acidity: 0.67
citric acid: 0.32
residual sugar: 4.54
chlorides: 5.68
free sulfur dioxide: 1.25
total sulfur dioxide: 1.52
density: 0.07
pH: 0.19
sulphates: 2.43
alcohol: 0.86
quality: 0.22

The skewness coefficient of White Wine: 

fixed acidity: 0.65
volatile acidity: 1.58
citric acid: 1.28
residual sugar: 1.08
chlorides: 5.02
free sulfur dioxide: 1.41
total sulfur dioxide: 0.39
density: 0.98
pH: 0.46
sulphates: 0.98
alcohol: 0.49
quality: 0.16

Observations from the skewness coefficient

  1. White Wine:
    • Chlorides (5.02): Highly skewness.
    • Volatile Acidity (1.58), Citric Acid (1.28), Residual Sugar (1.08), Free Sulfur Dioxide (1.41): : Moderate skewness.
  2. Red Wine:
    • Residual Sugar (4.54) and Chlorides (5.68): Highly skewness
    • Free Sulfur Dioxide (1.25), Total Sulfur Dioxide (1.52), Sulphates (2.43): Moderate skewness.

We are going to do log transformation for features of highly skewness and moderate skewness.

In [12]:
white_wine_log_columns = [
    'chlorides',
    'volatile acidity',
    'citric acid',
    'residual sugar',
    'free sulfur dioxide'
]

red_wine_log_columns = [
    'residual sugar',
    'chlorides',
    'free sulfur dioxide',
    'total sulfur dioxide',
    'sulphates'
]

log_red_wine_df = df_red_wine.copy()
log_white_wine_df = df_white_wine.copy()

log_red_wine_df[red_wine_log_columns] = np.log(log_red_wine_df[red_wine_log_columns] + 0.001)
log_white_wine_df[white_wine_log_columns] = np.log(log_white_wine_df[white_wine_log_columns]+ 0.001)

log_red_wine_df.wine_type = "Red Wine(Log)"
log_white_wine_df.wine_type = "White Wine(Log)"
In [13]:
log_dfs = [log_red_wine_df, log_white_wine_df]

create_plots(create_qq_plot, log_dfs)
create_plots(calculate_skewness_coefficient, log_dfs)
No description has been provided for this image
No description has been provided for this image
The skewness coefficient of Red Wine(Log): 

fixed acidity: 0.98
volatile acidity: 0.67
citric acid: 0.32
residual sugar: 1.81
chlorides: 1.79
free sulfur dioxide: -0.23
total sulfur dioxide: -0.08
density: 0.07
pH: 0.19
sulphates: 0.92
alcohol: 0.86
quality: 0.22

The skewness coefficient of White Wine(Log): 

fixed acidity: 0.65
volatile acidity: 0.14
citric acid: -5.56
residual sugar: -0.16
chlorides: 1.19
free sulfur dioxide: -0.94
total sulfur dioxide: 0.39
density: 0.98
pH: 0.46
sulphates: 0.98
alcohol: 0.49
quality: 0.16

Observations based on the first log transformation:

  1. Red Wine (Original vs Log-Transformed):

    • Original: Residual sugar 4.54, Chlorides 5.68.
    • Log-Transformed: Residual sugar 1.81, Chlorides 1.79.
    • Free sulfur dioxide changed from positive (1.25) to slightly negative skewness (-0.23).
  2. White Wine (Original vs Log-Transformed):

    • Original: Volatile acidity 1.58, Citric acid 1.28.
    • Log-Transformed: Volatile acidity 0.14, Citric acid -5.56 (over-correction).
    • Residual sugar reduced from 1.08 to -0.16, Chlorides from 5.02 to 1.19.
  3. Minimal Impact on Some Variables:

    • Alcohol and quality in both Red and White wines showed minimal changes (around 0.86 and 0.22 respectively).
  4. Avoid Log Transformation For:

    • Red Wine: Free sulfur dioxide and Total sulfur dioxide.
    • White Wine: Citric acid and Residual sugar .
In [14]:
white_wine_update_log_columns = [
    'chlorides',
    'volatile acidity'
]
red_wine_update_log_columns = [
    'residual sugar',
    'chlorides'
]

log_red_update_wine_df = df_red_wine.copy()
log_white_update_wine_df = df_white_wine.copy()

log_red_update_wine_df[red_wine_update_log_columns] = np.log(log_red_update_wine_df[red_wine_update_log_columns] + 0.001)
log_white_update_wine_df[white_wine_update_log_columns] = np.log(log_white_update_wine_df[white_wine_update_log_columns]+ 0.001)

log_red_update_wine_df.wine_type = "Red Wine(Second Log)"
log_white_update_wine_df.wine_type = "White Wine(Second Log)"

log_update_dfs = [log_white_update_wine_df, log_red_update_wine_df]

create_plots(create_qq_plot, log_update_dfs)
create_plots(calculate_skewness_coefficient, log_update_dfs)
No description has been provided for this image
No description has been provided for this image
The skewness coefficient of White Wine(Second Log): 

fixed acidity: 0.65
volatile acidity: 0.14
citric acid: 1.28
residual sugar: 1.08
chlorides: 1.19
free sulfur dioxide: 1.41
total sulfur dioxide: 0.39
density: 0.98
pH: 0.46
sulphates: 0.98
alcohol: 0.49
quality: 0.16

The skewness coefficient of Red Wine(Second Log): 

fixed acidity: 0.98
volatile acidity: 0.67
citric acid: 0.32
residual sugar: 1.81
chlorides: 1.79
free sulfur dioxide: 1.25
total sulfur dioxide: 1.52
density: 0.07
pH: 0.19
sulphates: 2.43
alcohol: 0.86
quality: 0.22

Observations:

  1. White Wine:
    • Log transformation significantly reduced skewness in volatile acidity (from 1.58 to 0.14) and chlorides (from 5.02 to 1.19).
  2. Red Wine:
    • Effective reduction in skewness for residual sugar (from 4.54 to 1.81) and chlorides (from 5.68 to 1.79).
  3. Conclusion:
    • The second log transformation was successful in reducing high skewness for key variables in both Red and White Wine datasets.

3.3 Clean Data¶

3.3.1 Check for missing values¶

We are going to check if there are empty values.

In [15]:
def check_na(wine_type_df):
    print(f'{get_wine_str(wine_type_df)}')
    print(wine_type_df.info())
    print(wine_type_df.isnull().sum())

create_plots(check_na, log_update_dfs)
White Wine(Second Log)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
None
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
Red Wine(Second Log)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Observations from Checking Missing Values:

  1. There are no missing data in either dataset.
  2. All feature columns are of the float type, and the target column is an integer.

3.3.2 Outlier Analysis¶

We will create box plots to understand the outliers.

In [37]:
def create_box_plot(wine_type_df):
    feature_columns = wine_type_df.drop(columns=['quality']).columns.tolist()
    fig, axs = plt.subplots(3, 4, figsize=(20, 15))  

    # Flatten the array of axes to make it easier to iterate over
    axs = axs.flatten()

    for i, column in enumerate(feature_columns): 
        sns.boxplot(x='quality', y=column, hue='quality', data=wine_type_df, ax=axs[i], palette='dark:.8', legend=False)

    fig.suptitle(f'Box Plots for {get_wine_str(wine_type_df)}', fontsize=16)
    plt.show()
In [38]:
create_plots(create_box_plot, log_update_dfs)
No description has been provided for this image
No description has been provided for this image

Observations from Box Plots

  1. It's hard to decide should we delete outliers or not based on these plots.
  2. We are going to develop separate QQ plots for wines classified into different quality categories: bad(3-4), middle(5-6-7), and good(8-9) to better understand our outliers.
In [18]:
bin_edges = [2, 4, 7, 10]
bin_labels = ['poor', 'middle', 'good']

red_wine_quality_df = log_red_update_wine_df.copy()
white_wine_quality_df = log_white_update_wine_df.copy()

red_wine_quality_df['quality_category'] = pd.cut(red_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
white_wine_quality_df['quality_category'] = pd.cut(white_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
In [19]:
df_red_poor = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'poor']
df_red_middle = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'middle']
df_red_good = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'good']

df_white_poor = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'poor']
df_white_middle = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'middle']
df_white_good = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'good']


wine_quality_dfs = [
    df_white_poor,
    df_white_middle,
    df_white_good,
    df_red_poor,
    df_red_middle,
    df_red_good
]
In [20]:
df_white_poor.wine_type = 'White Wine Poor'
df_white_middle.wine_type = 'White Wine Middle'
df_white_good.wine_type = 'White Wine Good'
df_red_poor.wine_type = 'Red Wine Poor'
df_red_middle.wine_type = 'Red Wine Middle'
df_red_good.wine_type = 'Red Wine Good'
In [21]:
create_plots(create_qq_plot, wine_quality_dfs)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

To understand distribution asymmetries, we use skewness coefficient to identify which variables deviate from normality.

In [22]:
create_plots(calculate_skewness_coefficient, wine_quality_dfs)
The skewness coefficient of White Wine Poor: 

fixed acidity: 0.86
volatile acidity: 0.33
citric acid: 0.43
residual sugar: 1.07
chlorides: 1.17
free sulfur dioxide: 4.48
total sulfur dioxide: 1.07
density: 0.23
pH: 0.56
sulphates: 0.68
alcohol: 0.61
quality: -2.53

The skewness coefficient of White Wine Middle: 

fixed acidity: 0.62
volatile acidity: 0.06
citric acid: 1.38
residual sugar: 1.07
chlorides: 1.22
free sulfur dioxide: 0.64
total sulfur dioxide: 0.31
density: 1.01
pH: 0.47
sulphates: 0.99
alcohol: 0.52
quality: 0.19

The skewness coefficient of White Wine Good: 

fixed acidity: -0.4
volatile acidity: 0.07
citric acid: 0.52
residual sugar: 0.86
chlorides: 0.48
free sulfur dioxide: 1.47
total sulfur dioxide: 0.56
density: 1.14
pH: 0.06
sulphates: 0.98
alcohol: -0.92
quality: 5.8

The skewness coefficient of Red Wine Poor: 

fixed acidity: 0.89
volatile acidity: 0.65
citric acid: 1.56
residual sugar: 1.36
chlorides: 2.34
free sulfur dioxide: 1.38
total sulfur dioxide: 1.29
density: 0.44
pH: -0.31
sulphates: 4.47
alcohol: 0.56
quality: -1.91

The skewness coefficient of Red Wine Middle: 

fixed acidity: 1.0
volatile acidity: 0.5
citric acid: 0.28
residual sugar: 1.84
chlorides: 1.74
free sulfur dioxide: 1.25
total sulfur dioxide: 1.51
density: 0.09
pH: 0.21
sulphates: 2.31
alcohol: 0.86
quality: 0.52

The skewness coefficient of Red Wine Good: 

fixed acidity: 0.04
volatile acidity: 1.72
citric acid: -0.39
residual sugar: 1.38
chlorides: -1.17
free sulfur dioxide: 1.49
total sulfur dioxide: 1.34
density: -0.24
pH: 0.42
sulphates: 1.46
alcohol: -0.23
quality: 0

To pinpoint extreme outliers, we use a a threshold of Z-score > 5 to focuse on the most anomalous data points in the wine quality dataset.

In [23]:
def calculate_z_score(wine_type_df):
    z_scores_df = wine_type_df.copy()

    print(f"\nZ Score of {get_wine_str(wine_type_df)}:\n")

    for col in wine_type_df.columns:
        if col != 'quality' and col != 'quality_category':
            z_scores_df[col + ' Z-score'] = zscore(wine_type_df[col])
            
            outliers = abs(z_scores_df[col + ' Z-score']) > 5
            print(f"Outliers in {col}: {outliers.sum()}")
    
    return z_scores_df
In [24]:
create_plots(calculate_z_score, wine_quality_dfs)
Z Score of White Wine Poor:

Outliers in fixed acidity: 0
Outliers in volatile acidity: 0
Outliers in citric acid: 0
Outliers in residual sugar: 0
Outliers in chlorides: 1
Outliers in free sulfur dioxide: 1
Outliers in total sulfur dioxide: 0
Outliers in density: 0
Outliers in pH: 0
Outliers in sulphates: 0
Outliers in alcohol: 0

Z Score of White Wine Middle:

Outliers in fixed acidity: 1
Outliers in volatile acidity: 0
Outliers in citric acid: 8
Outliers in residual sugar: 1
Outliers in chlorides: 6
Outliers in free sulfur dioxide: 2
Outliers in total sulfur dioxide: 0
Outliers in density: 3
Outliers in pH: 0
Outliers in sulphates: 2
Outliers in alcohol: 0

Z Score of White Wine Good:

Outliers in fixed acidity: 0
Outliers in volatile acidity: 0
Outliers in citric acid: 0
Outliers in residual sugar: 0
Outliers in chlorides: 0
Outliers in free sulfur dioxide: 0
Outliers in total sulfur dioxide: 0
Outliers in density: 0
Outliers in pH: 0
Outliers in sulphates: 0
Outliers in alcohol: 0

Z Score of Red Wine Poor:

Outliers in fixed acidity: 0
Outliers in volatile acidity: 0
Outliers in citric acid: 0
Outliers in residual sugar: 0
Outliers in chlorides: 0
Outliers in free sulfur dioxide: 0
Outliers in total sulfur dioxide: 0
Outliers in density: 0
Outliers in pH: 0
Outliers in sulphates: 1
Outliers in alcohol: 0

Z Score of Red Wine Middle:

Outliers in fixed acidity: 0
Outliers in volatile acidity: 0
Outliers in citric acid: 0
Outliers in residual sugar: 6
Outliers in chlorides: 12
Outliers in free sulfur dioxide: 1
Outliers in total sulfur dioxide: 2
Outliers in density: 0
Outliers in pH: 0
Outliers in sulphates: 7
Outliers in alcohol: 0

Z Score of Red Wine Good:

Outliers in fixed acidity: 0
Outliers in volatile acidity: 0
Outliers in citric acid: 0
Outliers in residual sugar: 0
Outliers in chlorides: 0
Outliers in free sulfur dioxide: 0
Outliers in total sulfur dioxide: 0
Outliers in density: 0
Outliers in pH: 0
Outliers in sulphates: 0
Outliers in alcohol: 0

Observations from Q-Q Plot, Skenewss Coefficient, and Z-score

White Wine:¶

  • Poor Quality:
    • High skewness: free sulfur dioxide (4.48), chlorides (1.17).
    • Notable outliers: chlorides (6), free sulfur dioxide (5).
  • Middle Quality:
    • Skewness: citric acid (1.38), chlorides (1.22), density (1.01).
    • Significant outliers: chlorides (102), citric acid (80), sulphates (42).
  • Good Quality:
    • Skewness: free sulfur dioxide(1.47), density (1.14), sulphates (0.98).
    • Fewer outliers: free sulfur dioxide (4), chlorides/citric acid (3).

Red Wine:¶

  • Poor Quality:
    • High skewness: sulphates (4.47), chlorides (2.34).
    • Outliers across variables: total sulfur dioxide(2), chlorides (1), sulphates (1).
  • Middle Quality:
    • Skewness: sulphates (2.31), residual sugar (1.84), chlorides (1.74).
    • Significant outliers: chlorides (35), residual sugar (29), sulphates (27).
  • Good Quality:
    • Slight skewness: volatile acidity (1.72), sulphates (1.46).
    • Minimal outliers(only one in volatile acidity).

Create heatmaps to understant the relationship between features and wine quality score.

In [42]:
def create_corr_matrix(wine_type_df):
       numeric_columns = check_numeric_columns(wine_type_df)

       #print(f'Correlation matrix for {get_wine_str(wine_type_df)}')
       #print(wine_type_df[numeric_columns].corr())
       return wine_type_df[numeric_columns].corr()

cmap = sns.diverging_palette(230, 20, as_cmap=True)

def create_heat_map(wine_type_df):

       correlation_matrix = create_corr_matrix(wine_type_df)

       plt.figure(figsize=(15, 10))

       ## do not display features that has a low correlation
       mask_low_corr = np.abs(correlation_matrix) < 0.05
       
       sns.heatmap(correlation_matrix, annot=True, cmap=cmap, mask = mask_low_corr, linewidths=.5, vmax=1, vmin=-1)

       plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}')
       plt.xticks(rotation=45)
       plt.yticks(rotation=45)
       plt.show()

def create_clustermap(wine_type_df):
       correlation_matrix = create_corr_matrix(wine_type_df)
       ## do not display features that has a low correlation
       mask_low_corr = np.abs(correlation_matrix) < 0.05
       print(mask_low_corr)

       plt.figure(figsize=(15, 15))
       sns.clustermap(correlation_matrix, annot=True, mask = mask_low_corr, cmap= cmap,linewidths=.5, vmax=1, vmin=-1)

       plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}')
       plt.xticks(rotation=45)
       plt.yticks(rotation=45)
       plt.show()
In [ ]:
create_plots(create_heat_map, wine_quality_dfs)
In [41]:
create_clustermap(df_red_good)
                      fixed acidity  volatile acidity  citric acid   
fixed acidity                 False             False        False  \
volatile acidity              False             False        False   
citric acid                   False             False        False   
residual sugar                False             False        False   
chlorides                     False             False        False   
free sulfur dioxide           False              True        False   
total sulfur dioxide          False             False        False   
density                       False             False        False   
pH                            False             False        False   
sulphates                     False             False        False   
alcohol                       False             False        False   
quality                       False             False        False   

                      residual sugar  chlorides  free sulfur dioxide   
fixed acidity                  False      False                False  \
volatile acidity               False      False                 True   
citric acid                    False      False                False   
residual sugar                 False      False                False   
chlorides                      False      False                False   
free sulfur dioxide            False      False                False   
total sulfur dioxide           False      False                False   
density                        False      False                False   
pH                             False      False                False   
sulphates                       True      False                False   
alcohol                        False      False                False   
quality                        False      False                False   

                      total sulfur dioxide  density     pH  sulphates   
fixed acidity                        False    False  False      False  \
volatile acidity                     False    False  False      False   
citric acid                          False    False  False      False   
residual sugar                       False    False  False       True   
chlorides                            False    False  False      False   
free sulfur dioxide                  False    False  False      False   
total sulfur dioxide                 False    False  False       True   
density                              False    False  False      False   
pH                                   False    False  False      False   
sulphates                             True    False  False      False   
alcohol                              False    False  False      False   
quality                              False    False  False      False   

                      alcohol  quality  
fixed acidity           False    False  
volatile acidity        False    False  
citric acid             False    False  
residual sugar          False    False  
chlorides               False    False  
free sulfur dioxide     False    False  
total sulfur dioxide    False    False  
density                 False    False  
pH                      False    False  
sulphates               False    False  
alcohol                 False    False  
quality                 False    False  
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[41], line 1
----> 1 create_clustermap(df_red_good)

Cell In[40], line 33, in create_clustermap(wine_type_df)
     30 print(mask_low_corr)
     32 plt.figure(figsize=(15, 15))
---> 33 sns.clustermap(correlation_matrix, annot=True, mask = mask_low_corr, cmap= cmap,linewidths=.5, vmax=1, vmin=-1)
     35 plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}')
     36 plt.xticks(rotation=45)

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1258, in clustermap(data, pivot_kws, method, metric, z_score, standard_scale, figsize, cbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, row_colors, col_colors, mask, dendrogram_ratio, colors_ratio, cbar_pos, tree_kws, **kwargs)
   1250     raise RuntimeError("clustermap requires scipy to be available")
   1252 plotter = ClusterGrid(data, pivot_kws=pivot_kws, figsize=figsize,
   1253                       row_colors=row_colors, col_colors=col_colors,
   1254                       z_score=z_score, standard_scale=standard_scale,
   1255                       mask=mask, dendrogram_ratio=dendrogram_ratio,
   1256                       colors_ratio=colors_ratio, cbar_pos=cbar_pos)
-> 1258 return plotter.plot(metric=metric, method=method,
   1259                     colorbar_kws=cbar_kws,
   1260                     row_cluster=row_cluster, col_cluster=col_cluster,
   1261                     row_linkage=row_linkage, col_linkage=col_linkage,
   1262                     tree_kws=tree_kws, **kwargs)

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1129, in ClusterGrid.plot(self, metric, method, colorbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, tree_kws, **kws)
   1125     kws.pop("square")
   1127 colorbar_kws = {} if colorbar_kws is None else colorbar_kws
-> 1129 self.plot_dendrograms(row_cluster, col_cluster, metric, method,
   1130                       row_linkage=row_linkage, col_linkage=col_linkage,
   1131                       tree_kws=tree_kws)
   1132 try:
   1133     xind = self.dendrogram_col.reordered_ind

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:974, in ClusterGrid.plot_dendrograms(self, row_cluster, col_cluster, metric, method, row_linkage, col_linkage, tree_kws)
    970 def plot_dendrograms(self, row_cluster, col_cluster, metric, method,
    971                      row_linkage, col_linkage, tree_kws):
    972     # Plot the row dendrogram
    973     if row_cluster:
--> 974         self.dendrogram_row = dendrogram(
    975             self.data2d, metric=metric, method=method, label=False, axis=0,
    976             ax=self.ax_row_dendrogram, rotate=True, linkage=row_linkage,
    977             tree_kws=tree_kws
    978         )
    979     else:
    980         self.ax_row_dendrogram.set_xticks([])

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:687, in dendrogram(data, linkage, axis, label, metric, method, rotate, tree_kws, ax)
    684 if _no_scipy:
    685     raise RuntimeError("dendrogram requires scipy to be installed")
--> 687 plotter = _DendrogramPlotter(data, linkage=linkage, axis=axis,
    688                              metric=metric, method=method,
    689                              label=label, rotate=rotate)
    690 if ax is None:
    691     ax = plt.gca()

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:495, in _DendrogramPlotter.__init__(self, data, linkage, metric, method, axis, label, rotate)
    492 self.rotate = rotate
    494 if linkage is None:
--> 495     self.linkage = self.calculated_linkage
    496 else:
    497     self.linkage = linkage

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:562, in _DendrogramPlotter.calculated_linkage(self)
    558         msg = ("Clustering large matrix with scipy. Installing "
    559                "`fastcluster` may give better performance.")
    560         warnings.warn(msg)
--> 562 return self._calculate_linkage_scipy()

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:530, in _DendrogramPlotter._calculate_linkage_scipy(self)
    529 def _calculate_linkage_scipy(self):
--> 530     linkage = hierarchy.linkage(self.array, method=self.method,
    531                                 metric=self.metric)
    532     return linkage

File /usr/local/lib/python3.11/site-packages/scipy/cluster/hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering)
   1061     raise ValueError("`y` must be 1 or 2 dimensional.")
   1063 if not np.all(np.isfinite(y)):
-> 1064     raise ValueError("The condensed distance matrix must contain only "
   1065                      "finite values.")
   1067 n = int(distance.num_obs_y(y))
   1068 method_code = _LINKAGE_METHODS[method]

ValueError: The condensed distance matrix must contain only finite values.
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
In [26]:
# print(create_plots(create_corr_matrix, wine_quality_dfs))
In [28]:
create_plots(create_clustermap, wine_quality_dfs)
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 create_plots(create_clustermap, wine_quality_dfs)

Cell In[8], line 3, in create_plots(plot_function, list)
      1 def create_plots(plot_function, list = wine_lists):
      2     for i in list:
----> 3         plot_function(i)

Cell In[25], line 32, in create_clustermap(wine_type_df)
     29 mask_low_corr = np.abs(correlation_matrix) < 0.05
     31 plt.figure(figsize=(15, 15))
---> 32 sns.clustermap(correlation_matrix, annot=True, mask = mask_low_corr, cmap= cmap,linewidths=.5, vmax=1, vmin=-1)
     34 plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}')
     35 plt.xticks(rotation=45)

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1258, in clustermap(data, pivot_kws, method, metric, z_score, standard_scale, figsize, cbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, row_colors, col_colors, mask, dendrogram_ratio, colors_ratio, cbar_pos, tree_kws, **kwargs)
   1250     raise RuntimeError("clustermap requires scipy to be available")
   1252 plotter = ClusterGrid(data, pivot_kws=pivot_kws, figsize=figsize,
   1253                       row_colors=row_colors, col_colors=col_colors,
   1254                       z_score=z_score, standard_scale=standard_scale,
   1255                       mask=mask, dendrogram_ratio=dendrogram_ratio,
   1256                       colors_ratio=colors_ratio, cbar_pos=cbar_pos)
-> 1258 return plotter.plot(metric=metric, method=method,
   1259                     colorbar_kws=cbar_kws,
   1260                     row_cluster=row_cluster, col_cluster=col_cluster,
   1261                     row_linkage=row_linkage, col_linkage=col_linkage,
   1262                     tree_kws=tree_kws, **kwargs)

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1129, in ClusterGrid.plot(self, metric, method, colorbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, tree_kws, **kws)
   1125     kws.pop("square")
   1127 colorbar_kws = {} if colorbar_kws is None else colorbar_kws
-> 1129 self.plot_dendrograms(row_cluster, col_cluster, metric, method,
   1130                       row_linkage=row_linkage, col_linkage=col_linkage,
   1131                       tree_kws=tree_kws)
   1132 try:
   1133     xind = self.dendrogram_col.reordered_ind

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:974, in ClusterGrid.plot_dendrograms(self, row_cluster, col_cluster, metric, method, row_linkage, col_linkage, tree_kws)
    970 def plot_dendrograms(self, row_cluster, col_cluster, metric, method,
    971                      row_linkage, col_linkage, tree_kws):
    972     # Plot the row dendrogram
    973     if row_cluster:
--> 974         self.dendrogram_row = dendrogram(
    975             self.data2d, metric=metric, method=method, label=False, axis=0,
    976             ax=self.ax_row_dendrogram, rotate=True, linkage=row_linkage,
    977             tree_kws=tree_kws
    978         )
    979     else:
    980         self.ax_row_dendrogram.set_xticks([])

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:687, in dendrogram(data, linkage, axis, label, metric, method, rotate, tree_kws, ax)
    684 if _no_scipy:
    685     raise RuntimeError("dendrogram requires scipy to be installed")
--> 687 plotter = _DendrogramPlotter(data, linkage=linkage, axis=axis,
    688                              metric=metric, method=method,
    689                              label=label, rotate=rotate)
    690 if ax is None:
    691     ax = plt.gca()

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:495, in _DendrogramPlotter.__init__(self, data, linkage, metric, method, axis, label, rotate)
    492 self.rotate = rotate
    494 if linkage is None:
--> 495     self.linkage = self.calculated_linkage
    496 else:
    497     self.linkage = linkage

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:562, in _DendrogramPlotter.calculated_linkage(self)
    558         msg = ("Clustering large matrix with scipy. Installing "
    559                "`fastcluster` may give better performance.")
    560         warnings.warn(msg)
--> 562 return self._calculate_linkage_scipy()

File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:530, in _DendrogramPlotter._calculate_linkage_scipy(self)
    529 def _calculate_linkage_scipy(self):
--> 530     linkage = hierarchy.linkage(self.array, method=self.method,
    531                                 metric=self.metric)
    532     return linkage

File /usr/local/lib/python3.11/site-packages/scipy/cluster/hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering)
   1061     raise ValueError("`y` must be 1 or 2 dimensional.")
   1063 if not np.all(np.isfinite(y)):
-> 1064     raise ValueError("The condensed distance matrix must contain only "
   1065                      "finite values.")
   1067 n = int(distance.num_obs_y(y))
   1068 method_code = _LINKAGE_METHODS[method]

ValueError: The condensed distance matrix must contain only finite values.
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
In [ ]:
create_clustermap(df_red_wine)
In [ ]:
create_clustermap(log_white_wine_df)
In [ ]:
create_clustermap(log_red_wine_df)

Conclusions from the clustermaps

  • Features that have high correlation between each other but low correlation with quality can be used to reduced dementions of our data (We can use PCA to identyfe as many features we need)

Selecting the most importnat features for quality prediction using top K features¶

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, f_classif

def visualize_feature_importance(datasets, names, k=5):
    num_datasets = len(datasets)

    # Create subplots
    fig, axes = plt.subplots(1, num_datasets, figsize=(15, 5), sharey=True)

    for i, (X, y) in enumerate(datasets):
        selector = SelectKBest(score_func=f_classif, k=k)
        X_new = selector.fit_transform(X, y)

        # Visualize the scores of features
        scores = -np.log10(selector.pvalues_)
        scores /= scores.max()

        # Plotting the bar plot for each dataset
        axes[i].bar(range(X.shape[1]), scores, tick_label=X.columns.values.tolist())
        axes[i].set_xticks(range(X.shape[1]))
        axes[i].set_xticklabels(X.columns.values.tolist(), rotation=90)
        axes[i].set_xlabel('Feature')
        axes[i].set_ylabel('Score (-log10 p-value)')
        axes[i].set_title(f'Dataset {names[i]} Feature Importance Scores')

    plt.tight_layout()
    plt.show()

# Example usage with two datasets (red wine and white wine)
datasets = [
    (df_red_wine.drop(columns=['quality']), df_red_wine['quality']),
    (df_white_wine.drop(columns=['quality']), df_white_wine['quality'])
]

visualize_feature_importance(datasets, ['Red Wine', 'White Wine'], k=5)

Visualize Data¶

  1. Quality distribiution based on the wine type (color)
In [ ]:
def visualize_quality_histogram(datasets):
    num_datasets = len(datasets)

    # Create subplots
    fig, axes = plt.subplots(1, num_datasets, figsize=(15, 5), sharey=True)

    for i, (df, name) in enumerate(datasets):
        quality_column = df['quality']
        axes[i].hist(quality_column, bins=20, edgecolor='black', density = True)
        axes[i].set_xlabel('Quality Score')
        axes[i].set_ylabel('Density')
        axes[i].set_title(f'{name} Quality Score Distribution')

    plt.tight_layout()
    plt.show()

# Example usage with two datasets (red wine and white wine)
datasets = [
    (df_red_wine, 'Red Wine'),
    (df_white_wine, 'White Wine')
]

visualize_quality_histogram(datasets)
In [ ]:
 

Significant Statement¶

Conclusion & Discussions¶